Implement random_sample() #24492

bushshrub · 2022-05-05T03:16:15Z

Why are these changes needed?

Addresses issue #24449

A random_sample() API was added to datasets.

Related issue number

#24449

Notes

Might be good to add some unit tests to ensure everything is nice and reliable. I'll try to work on these later.

Checks

I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

python/ray/data/dataset.py

bushshrub · 2022-05-05T03:57:33Z

~~Have yet to test against pandas DataFrames yet.~~

~~Note: Currently fails against range_arrow~~

…ts, n > x

jianoaix

Thanks for making this contribution!

python/ray/data/dataset.py

ericl · 2022-05-06T03:10:35Z

I'm looking at Spark's sample function https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc . If we just take a fraction, it makes it much simpler to implement sample, since the same sample function can be applied to each block regardless of size. We could also make sample return another Dataset instead of rows directly, to make it more scalable.

…

On Thu, May 5, 2022, 8:04 PM Jian Xiao ***@***.***> wrote: ***@***.**** commented on this pull request. ------------------------------ In python/ray/data/dataset.py <#24492 (comment)>: > + raise ValueError("Cannot from an empty dataset") + + if number < 1: + raise ValueError("Cannot sample less than 1 element.") + + count = self._meta_count() + + if number > count: + raise ValueError( + "Cannot sample more elements than there are in the dataset" + ) + + if seed: + random.seed(seed) + + n_required = number // self.num_blocks() One potential algorithm is reservoir sampling ( https://en.wikipedia.org/wiki/Reservoir_sampling) which can just linear scan each block (and doesn't need to know how many rows in the block or in dataset). — Reply to this email directly, view it on GitHub <#24492 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAADUSR56UPI7YUWCFVGJ5DVISD3LANCNFSM5VDZDTWQ> . You are receiving this because your review was requested.Message ID: ***@***.***>

…method

bushshrub · 2022-05-06T03:22:27Z

I'm looking at Spark's sample function https://medium.com/udemy-engineering/pyspark-under-the-hood-randomsplit-and-sample-inconsistencies-examined-7c6ec62644bc . If we just take a fraction, it makes it much simpler to implement sample, since the same sample function can be applied to each block regardless of size. We could also make sample return another Dataset instead of rows directly, to make it more scalable.
…
On Thu, May 5, 2022, 8:04 PM Jian Xiao @.> wrote: @.* commented on this pull request. ------------------------------ In python/ray/data/dataset.py <#24492 (comment)>: > + raise ValueError("Cannot from an empty dataset") + + if number < 1: + raise ValueError("Cannot sample less than 1 element.") + + count = self._meta_count() + + if number > count: + raise ValueError( + "Cannot sample more elements than there are in the dataset" + ) + + if seed: + random.seed(seed) + + n_required = number // self.num_blocks() One potential algorithm is reservoir sampling ( https://en.wikipedia.org/wiki/Reservoir_sampling) which can just linear scan each block (and doesn't need to know how many rows in the block or in dataset). — Reply to this email directly, view it on GitHub <#24492 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAADUSR56UPI7YUWCFVGJ5DVISD3LANCNFSM5VDZDTWQ . You are receiving this because your review was requested.Message ID: @.***>

How would the blocks be assigned if a new dataset were to be returned?

ericl · 2022-05-06T03:28:42Z

You could .map_batches(sample(0.6)) (in pseudocode) to implement the Spark sampling strategy. This would return a Dataset with the same number of blocks, where each block is 60% in size of the original (approximately).

It would then be up to the user to take / iterate over the downsampled dataset, which would give maximum flexibility. What do you think?

…e it less random

bushshrub · 2022-05-06T03:30:32Z

Sounds like a good idea. Should I implement that as an extension to the current random_sample function? Similar to what pandas does with their .sample

It would definitely be easier to simply randomly cut the dataset down to x%

Co-authored-by: Eric Liang <[email protected]>

bushshrub · 2022-05-12T03:59:20Z

Thanks for the feedback! I didn't catch the failing tests since I only ran the tests for random sample. My bad.

python/ray/data/tests/test_dataset.py

python/ray/data/dataset.py

python/ray/data/tests/test_dataset.py

python/ray/data/dataset.py

python/ray/data/tests/test_dataset.py

clarkzinzow

LGTM overall, mainly just a perf nit!

python/ray/data/dataset.py

clarkzinzow · 2022-05-13T01:39:47Z

python/ray/data/tests/test_dataset.py

+    ensure_sample_size_close(ds)
+    # Small datasets
+    ds1 = ray.data.range(5, parallelism=5)
+    ensure_sample_size_close(ds1)


Nice tests!

Thank you for the feedback!

This utilizes more concise terminology for the generation of the mask Co-authored-by: Clark Zinzow <[email protected]>

clarkzinzow

Awesome, great work! 🙌

ericl · 2022-05-16T00:54:04Z

FAILED ::test_parquet_read_spread - RuntimeError: Maybe you called ray.init t...

This one is still failing unfortunately. I think the fix is simple: move the unit tests up a few lines to 3426, to be next to the other ray_start_regular_shared tests.

bushshrub · 2022-05-16T09:27:24Z

Sure, I'll take care of that.

bushshrub · 2022-05-17T10:10:02Z

Fixed!

…A. (#25010) * [Datasets] Add `from_huggingface` for Hugging Face datasets integration (#24464) Adds a from_huggingface method to Datasets, which allows the conversion of a Hugging Face Dataset to a Ray Dataset. As a Hugging Face Dataset is backed by an Arrow table, the conversion is trivial. * Test the CSV read with column types specified (#24398) Make sure users can read csv with columns types specified. Users may want to do this because sometimes PyArrow's type inference doesn't work as intended, in which case users can step in and work around the type inference. * [Datasets] [Docs] Add a warning about from_huggingface (#24608) Adds a warning to docs about the intended use of from_huggingface. * [data] Expose `drop_last` in `to_tf` (#24666) * [data] More informative exceptions in block impl (#24665) * Add a classic yet small-sized ML dataset for demo/documentation/testing (#24592) To facilitate easy demo/documentation/testing with realistic, small-sized yet ML-familiar data. Have it as a source file with code will make it self-contained, i.e. after user "pip install" Ray, they are all set to run it. IRIS is a great fit: super classic ML dataset, simple schema, only 150 rows. * [Datasets] Add more example data. (#24795) This PR adds more example data for ongoing feature guide work. In addition to adding the new datasets, this also puts all example data under examples/data in order to separate it from the example code. * [Datasets] Add example protocol for reading canned in-package example data. (#24800) Providing easy-access datasets is table stakes for a good Getting Started UX, but even with good in-package data, it can be difficult to make these paths accessible to the user. This PR adds an "example://" protocol that will resolve passed paths directly to our canned in-package example data. * [minor] Use np.searchsorted to speed up random access dataset (#24825) * [Datasets] Change `range_arrow()` API to `range_table()` (#24704) This PR changes the ray.data.range_arrow() to ray.data.range_table(), making the Arrow representation an implementation detail. * [Datasets] Support tensor columns in `to_tf` and `to_torch`. (#24752) This PR adds support for tensor columns in the to_tf() and to_torch() APIs. For Torch, this involves an explicit extension array check and (zero-copy) conversion of the tensor column to a NumPy array before converting the column to a Torch tensor. For TensorFlow, this involves bypassing df.values when converting tensor feature columns to NumPy arrays, instead manually creating a single NumPy array from the column Series. In both cases, I think that the UX around heterogeneous feature columns and squeezing the column dimension could be improved, but I'm saving that for a future PR. * Implement random_sample() (#24492) * Map progress bar title; pretty repr for rows. (#24672) * [Datasets] [CI] fix CI of dataset test (#24883) CI test is broken by f61caa3. This PR fixes it. * [Datasets] Add explicit resource allocation option via a top-level scheduling strategy (#24438) Instead of letting Datasets implicitly use cluster resources in the margins of explicit allocations of other libraries, such as Tune, Datasets should provide an option for explicitly allocating resources for a Datasets workload for users that want to box Datasets in. This PR adds such an explicit resource allocation option, via exposing a top-level scheduling strategy on the DatasetContext with which a placement group can be given. * [Datasets] Add example of using `map_batches` to filter (#24202) The documentation says > Consider using .map_batches() for better performance (you can implement filter by dropping records). but there aren't any examples of how to do so. * [doc] Add docs for push-based shuffle in Datasets (#24486) Adds recommendations, example, and brief benchmark results for push-based shuffle in Datasets. * [Doc][Data] fix big-data-ingestion broken links (#24631) The links were broken. Fixed it. * [docs] Fix import error in Ray Data "getting started" (#24424) We did `import pandas as pd` but here we are using it as `pandas` * [Datasets] Overhaul of "Creating Datasets" feature guide. (#24831) This PR is a general overhaul of the "Creating Datasets" feature guide, providing complete coverage of all (public) dataset creation APIs and highlighting features and quirks of the individual APIs, data modalities, storage backends, etc. In order to keep the page from getting too long and keeping it easy to navigate, tabbed views are used heavily. * [Datasets] Add basic data ecosystem overview, user guide links, other data processing options card. (#23346) * Revamp the Getting Started page for Dataset (#24860) This is part of the Dataset GA doc fix effort to update/improve the documentation. This PR revamps the Getting Started page. What are the changes: - Focus on basic/core features that are bread-and-butter for users, leave the advanced features out - Focus on high level introduction, leave the detailed spec out (e.g. what are possible batch_types for map_batches() API) - Use more realistic (yet still simple) data example that's familiar to people (IRIS dataset in this case) - Use the same data example throughout to make it context-switch free - Use runnable code rather than faked - Reference to the code from doc, instead of inlining them in the doc Co-authored-by: Ubuntu <[email protected]> Co-authored-by: Eric Liang <[email protected]> * [Datasets] Miscellaneous GA docs P0s. (#24891) This PR knocks off a few miscellaneous GA docs P0s given in our docs tracker. Namely: - Documents Datasets resource allocation model. - De-emphasizes global/windowed shuffling. - Documents lazy execution mode, and expands our execution model docs in general. * [docs] After careful consideration, choose the lesser of two evils and set white-space: pre-wrap #24873 * [Datasets] [Tensor Story - 1/2] Automatically provide tensor views to UDFs and infer tensor blocks for pure-tensor datasets. (#24812) This PR makes several improvements to the Datasets' tensor story. See the issues for each item for more details. - Automatically infer tensor blocks (single-column tables representing a single tensor) when returning NumPy ndarrays from map_batches(), map(), and flat_map(). - Automatically infer tensor columns when building tabular blocks in general. - Fixes shuffling and sorting for tensor columns This should improve the UX/efficiency of the following: - Working with pure-tensor datasets in general. - Mapping tensor UDFs over pure-tensor, a better foundation for tensor-native preprocessing for end-users and AIR. * [Datasets] Overhaul "Accessing Datasets" feature guide. (#24963) This PR overhauls the "Accessing Datasets", adding proper coverage of each data consuming methods, including the ML framework exchange APIs (to_torch() and to_tf()). * [Datasets] Add FAQ to Datasets docs. (#24932) This PR adds a FAQ to Datasets docs. Docs preview: https://ray--24932.org.readthedocs.build/en/24932/ - [x] I've run `scripts/format.sh` to lint the changes in this PR. - [x] I've included any doc changes needed for https://docs.ray.io/en/master/. - [x] I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/ - Testing Strategy - [x] Unit tests - [ ] Release tests - [ ] This PR is not tested :( Co-authored-by: Eric Liang <[email protected]> * [Datasets] Add basic e2e Datasets example on NYC taxi dataset (#24874) This PR adds a dedicated docs page for examples, and adds a basic e2e tabular data processing example on the NYC taxi dataset. The goal of this example is to demonstrate basic data reading, inspection, transformations, and shuffling, along with ingestion into dummy model trainers and doing dummy batch inference, for tabular (Parquet) data. * Revamp the Datasets API docstrings (#24949) * Revamp the Saving Datasets user guide (#24987) * Fix AIR references in Datasets FAQ. * [Datasets] Skip flaky pipelining memory release test (#25009) This pipelining memory release test is flaky; it was skipped in this Polars PR, which was then reverted. * Note that explicit resource allocation is experimental, fix typos (#25038) * fix the notebook test failure * no-op indent fix * fix notebooks test #2 * Revamp the Transforming Datasets user guide (#25033) * Fix range_arrow(), which is replaced by range_table() (#25036) * indent * allow empty * Proofread the some datasets docs (#25068) Co-authored-by: Ubuntu <[email protected]> * [Data] Add partitioning classes to Data API reference (#24203) Co-authored-by: Antoni Baum <[email protected]> Co-authored-by: Jian Xiao <[email protected]> Co-authored-by: Eric Liang <[email protected]> Co-authored-by: Robert <[email protected]> Co-authored-by: Balaji Veeramani <[email protected]> Co-authored-by: Stephanie Wang <[email protected]> Co-authored-by: Chen Shen <[email protected]> Co-authored-by: Zhe Zhang <[email protected]> Co-authored-by: Ubuntu <[email protected]>

bushshrub added 3 commits May 5, 2022 11:11

Build random_sample feature (ray-project#24449)

f0e40b8

Ran scripts/format.sh

ade5959

Merge branch 'ray-project:master' into master

6c67d1c

bushshrub requested review from ericl, scv119, clarkzinzow and jjyao as code owners May 5, 2022 03:16

bushshrub changed the title ~~Build random_sample feature (#24449)~~ Implement random_sample() May 5, 2022

ericl reviewed May 5, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

ericl assigned clarkzinzow and jianoaix May 5, 2022

bushshrub added 2 commits May 5, 2022 11:56

Make random_sample more random

d54f8ff

Merge remote-tracking branch 'origin/master'

dcd8602

bushshrub added 4 commits May 5, 2022 12:03

Add some dataset validity checks

93f9dfb

Fix random_sample() for non-list types

098426c

Account for possibly attempting to sample n from a batch for x elemen…

72da35d

…ts, n > x

Run scripts/format.sh

d76fb07

jianoaix reviewed May 5, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

Update sampling algorithm, updated documentation to explain sampling …

61791dd

…method

Run format script

c7b9f55

Add a random_shuffle on the sample_population since .take() will leav…

cfa9291

…e it less random

bushshrub added 4 commits May 6, 2022 11:44

run format script

52d3063

Add tests for random sampling

b970866

Merge branch 'ray-project:master' into master

4d903af

Merge remote-tracking branch 'origin/master'

aa57653

ericl self-assigned this May 12, 2022

bushshrub and others added 2 commits May 12, 2022 11:58

Fix failing test

01d922f

Co-authored-by: Eric Liang <[email protected]>

Fix failing test ray-project#2

ef4bcae

Co-authored-by: Eric Liang <[email protected]>

jianoaix reviewed May 12, 2022

View reviewed changes

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved

bushshrub added 6 commits May 12, 2022 13:52

Merge branch 'ray-project:master' into master

a6885d1

Break up ValueError assertions

9aa8337

Update documentation to reflect the number of items being returned

17e8e5c

Add handling to address len(batch) * fraction < 1

46290c1

Test for 46290c1

c9c6eb3

Run the format script

67fac90

jianoaix reviewed May 12, 2022

View reviewed changes

python/ray/data/dataset.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved

python/ray/data/tests/test_dataset.py Outdated Show resolved Hide resolved

bushshrub added 4 commits May 13, 2022 09:04

Resolve minor issues

18b90c5

Always use strategy = 1

c64d4b5

Remove unused import math

b1b45c5

Merge branch 'ray-project:master' into master

f772686

clarkzinzow reviewed May 13, 2022

View reviewed changes

bushshrub and others added 2 commits May 13, 2022 10:17

Explain mask generation

a4cbbde

This utilizes more concise terminology for the generation of the mask Co-authored-by: Clark Zinzow <[email protected]>

Performance improvements for pyarrow

a8fecf3

jianoaix approved these changes May 13, 2022

View reviewed changes

clarkzinzow approved these changes May 14, 2022

View reviewed changes

bushshrub added 2 commits May 17, 2022 15:04

Merge branch 'ray-project:master' into master

4777439

Fixes failing test: test_parquet_read_spread

88d11db

ericl merged commit f61caa3 into ray-project:master May 17, 2022

maxpumperla pushed a commit that referenced this pull request May 18, 2022

Implement random_sample() (#24492)

85f9dc8

clarkzinzow pushed a commit to clarkzinzow/ray that referenced this pull request May 20, 2022

Implement random_sample() (ray-project#24492)

0a7597f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement random_sample() #24492

Implement random_sample() #24492

bushshrub commented May 5, 2022 •

edited

Loading

bushshrub commented May 5, 2022 •

edited

Loading

jianoaix left a comment

ericl commented May 6, 2022 via email

bushshrub commented May 6, 2022

ericl commented May 6, 2022

bushshrub commented May 6, 2022 •

edited

Loading

bushshrub commented May 12, 2022 •

edited

Loading

clarkzinzow left a comment

clarkzinzow May 13, 2022

bushshrub May 13, 2022

clarkzinzow left a comment

ericl commented May 16, 2022 •

edited

Loading

bushshrub commented May 16, 2022

bushshrub commented May 17, 2022

Implement random_sample() #24492

Implement random_sample() #24492

Conversation

bushshrub commented May 5, 2022 • edited Loading

Why are these changes needed?

Related issue number

Notes

Checks

bushshrub commented May 5, 2022 • edited Loading

jianoaix left a comment

Choose a reason for hiding this comment

ericl commented May 6, 2022 via email

bushshrub commented May 6, 2022

ericl commented May 6, 2022

bushshrub commented May 6, 2022 • edited Loading

bushshrub commented May 12, 2022 • edited Loading

clarkzinzow left a comment

Choose a reason for hiding this comment

clarkzinzow May 13, 2022

Choose a reason for hiding this comment

bushshrub May 13, 2022

Choose a reason for hiding this comment

clarkzinzow left a comment

Choose a reason for hiding this comment

ericl commented May 16, 2022 • edited Loading

bushshrub commented May 16, 2022

bushshrub commented May 17, 2022

bushshrub commented May 5, 2022 •

edited

Loading

bushshrub commented May 5, 2022 •

edited

Loading

bushshrub commented May 6, 2022 •

edited

Loading

bushshrub commented May 12, 2022 •

edited

Loading

ericl commented May 16, 2022 •

edited

Loading